Masthead

Lab: Categorical And Regression Trees (CART)

This lab will introduce you to modeling using Categorical and Regression Trees (CART). These trees can model continuous or categorical data but are particularly well suited for categorical response data. In R there are two CART libraries, tree and rpart. Rpart is the newer of the two and is used by Plant so we'll use it.

Note that for this lab, you'll need the rpart and the caret libraries and use na.omit() to remove invalid values in your data.

CART for Categorial Response Vaiables

CART is probably most useful for situations where your response variable is categorical such as presence/absence, dominante species, or fire severity. Your response data will need to be integers or a "factor" in R for rpart to work correctly. You may also want to explore the older "tree" library as well. The DominnantTrees_CA CSV file, at the top of the class schedule page, contains the dominant tree species (species with the largest number of trees) from the FIA plots in California. The data includes the common name, scientific name, and a species code (spcd) for each dominant species. The data also has envrionmental variables for Annual Mean Temp and Annual Mean Precip from BioClim. We can use the CommonName or spcd values to create a categorical tree in R. First, we need to load the data and create the tree model. Note that we're also using the "class" ("classification") for the method instead of "anova".

# Read the data
TheData = read.csv("D:\\jimg\\Classes\\GSP 570\\CART\\DominnantTrees_CA_Temp_Precip.csv")

# Remove any NA (null) values
TheData=na.omit(TheData)

# create the tree 
TheTree=rpart(CommonName~AnnualPrecip+AnnualTemp,data=TheData,method="class") # create a clasification tree

Then, we can plot the tree. Note that I needed to increase the margins to allow for the size of the titles on the nodes of the tree.

par(mar=c(2,5,2,5)) # margins: bottom,left, top,right
plot(TheTree) # plot the tree (without labels)
text(TheTree) # add the labels to the branches

These plots are not the most attrictive so we can also use another library to create better looking plots. Take a look at Plotting rpart trees with the rpart.plot package for more options.

library(rpart.plot) # load the library
rpart.plot(TheTree,type=1,extra=0,box.palette=0) # plot the tree 

The summary for a categorical tree gives a pruning table of results for each level of splits in the tree:

Below the pruning table is a text-based definition for the tree and results at each node. Note that this is a "code" version of the tree. It starts with the "root" of the tree at node 1. Node 2 shows a split (to the left) . The tree continues until we have terminal nodes. For each line of this version of the tree, there is also the number of data points at the node, the deviance explained at that node, and the response value (yval) for the node.

summary(TheTree) # print the pruning tree and the details for each node
printcp(TheTree) # just print the pruning tree

Each time we make a split we increase the fit of the model but we also increase the trees complexity. Thus we want to control the complixty of the tree. We can do this with the rpart.control() function which can include a variable for the minimum number of values for a split to occur (minsplit) and the complexity paramater (cp).

TheControl=rpart.control(minsplit=5,cp=0.2) #create a control object to change the model parameters
TheTree=rpart(CommonName~AnnualPrecip+AnnualTemp,data=TheData,method="class",control=TheControl) # run the model with the parameters

Try different values of minspit and cp to see their impact on the tree.

Evaluation

We can create a prediction of our data and then add it to the existing dataframe and save it as a new file to see if our values match what is expected.

ThePrediction = predict(TheTree,TheData,type="class") # the type parameter specifies we want a classification (categories) rather than continous

FinalData=data.frame(TheData,ThePrediction)
write.csv(FinalData, file = "D:\\jimg\\Classes\\GSP 570\\CART\\TheDataWithPrediction.csv")

Evaluating Categorical Trees is a little different from other regression methods because we do not have a continours repsonse so we cannot compute resituals. Instead, we can create a confusion matrix and then have the caret package compute overall statistics for accuracy of the tree model.

TheResults=table(ThePrediction, TheData$CommonName) # create the table showing correct and incorrect matches
write.csv(TheResults, file = "D:\\jimg\\Classes\\GSP 570\\CART\\TheResults") # save the table to view in MS-Excel
confusionMatrix(TheResults) # print the confusion matrix and summary statistics	

Scroll through the results of the confusion matrix and you'll see a block of summary statistics that include:

Prediction

Now, we can create a new prediction based on the entire study area, evaluate it, and create a final predicted surface. Below is the code to create the prediction.

# Read the data
NewData = read.csv("D:\\jimg\\Classes\\GSP 570\\CART\\CA_Temp_Precip.csv")

# Remove any NA (null) values
NewData=na.omit(NewData)
ThePrediction = predict(TheTree,NewData,type="class") 
FinalData=data.frame(NewData,ThePrediction)
write.csv(FinalData, file = "D:\\jimg\\Classes\\GSP 570\\CART\\TheDataWithPrediction.csv")

Resources

There is some information in the R for Spatial Statistics web site but I need to restructure the section and integrate information presented here. The web page is in CART under Section 7, "Correlation/Regression Modeling"

Plant's book section 8.3

Confusion Matrix in caret

Model Evaluation Techniques for Classification models

Cohen's Kappa

Tutorial: How to Assess Model Accuracy

© Copyright 2018 HSU - All rights reserved.